ENH: Sparse int64 and bool dtype support enhancement #13849

sinhrks · 2016-07-30T10:26:34Z

closes Support dtypes other than float in sparse data structures #667, closes BUG: int sparse series is converted to float if sliced #8292, closes _combine_const() in pandas.sparse.frame does not have uniform method signature with pandas.core.frame #13001, closes SparseDataFrame.isnull raises an error #8276, closes BUG: SparseSeries/DataFrame non-float dtypes repr #13110
- tests added / passed
- passes git diff upstream/master | flake8 --diff
- whatsnew entry

Currently, sparse doesn't support int64 and bool dtypes actually. When int or bool values are passed, it is coerced to float64 if dtypekw is not explicitly specified.

on current master

pd.SparseArray([1, 2, 0, 0 ])
# [1.0, 2.0, 0.0, 0.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2, 3], dtype=int32)

pd.SparseArray([True, False, True])
# [1.0, 0.0, 1.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2], dtype=int32)

after this PR

The created data should have the dtype of passed values (as the same as normal Series).

pd.SparseArray([1, 2, 0, 0 ])
# [1, 2, 0, 0]
# Fill: 0
# IntIndex
# Indices: array([0, 1], dtype=int32)

pd.SparseArray([True, False, True])
# [True, False, True]
# Fill: False
# IntIndex
# Indices: array([0, 2], dtype=int32)

Also, fill_value is automatically specified according to the following rules (because np.nan cannot appear in int or bool dtype):

Basic rule: sparse dtype must not be changed when it is converted to dense.

If sparse_index is specified and data has a hole (missing values):
- fill_value is np.nan
- dtype is float64 or object (which can store both data and fill_value)
If sparse_index is None (all values are provided via data, no missing values)
- if fill_value is not explicitly passed, following default will be used depending on its dtype.
  - float: np.nan
  - int: 0
  - bool: False

codecov-io · 2016-07-30T12:31:23Z

Current coverage is 85.27% (diff: 98.63%)

Merging #13849 into master will increase coverage by <.01%

@@             master     #13849   diff @@
==========================================
  Files           139        139          
  Lines         50511      50523    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43071      43083    +12   
  Misses         7440       7440          
  Partials          0          0

Powered by Codecov. Last update 10bf721...341585a

split from #13849 Author: sinhrks <sinhrks@gmail.com> Closes #13900 from sinhrks/sparse_astype and squashes the following commits: 1c669ad [sinhrks] ENH: sparse astype now supports int64 and bool

jreback · 2016-08-07T00:04:49Z

@sinhrks getting tons of warnings compiling on windows....all the same

pandas\src\sparse.c(63861) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(63870) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66180) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66189) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(68499) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data

closes #13942 xref #13849

jorisvandenbossche · 2016-08-21T13:45:37Z

Sorry, not familiar with sparse. But: using object dtype, does it work enough to use it for certain cases? If yes, I would not remove it.

jorisvandenbossche · 2016-08-21T18:19:33Z

@sinhrks Does this also close #13110?

sinhrks · 2016-08-21T22:11:53Z

I think object dtype can be used in some cases, but not fully sure as it is not tested well. Not remove ATM and add more tests to clarify (on another PR).

#13110 should be closed. Added whatsnew.

jreback · 2016-08-25T10:42:14Z

doc/source/whatsnew/v0.19.0.txt

@@ -17,6 +17,7 @@ Highlights include:
 - ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
 - pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
 - ``PeriodIndex`` now has its own ``period`` dtype. see ref:`here <whatsnew_0190.api.perioddtype>`
+- Sparse now supports other ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>`


would leave out other

jorisvandenbossche · 2016-08-27T15:05:41Z

Disclaimer: I never used sparse or am familiar with the implementation (so my excuses if it is a stupid or naive question), but I quickly looked at the PR and have the following question.

Previously, for integer and boolean serieses, the 0 or False values were regarded as actual values, not an indication of 'not a value' in the sparse series. Isn't this a big change? (I don't know how much you could use it before this PR to be a problem)
Next to that, having eg False for boolean arrays as the default fill_value also seems a bit strange to me. I would expect that somebody who wants a boolean sparse array, would want to be able to have both True and False values as actual values? (eg something like [True, -, -, False, -, -, True])?
Of course this is currently because boolean serieses cannot have anything else as True or False.

jorisvandenbossche · 2016-08-27T15:09:28Z

OK, so probably my question should be categorized in the naive category :-)
I see that this is the same as what scipy.sparse does, so seems like a sensible default then.

jorisvandenbossche · 2016-08-27T15:10:24Z

doc/source/sparse.rst

+
+Sparse data should have the same dtype as its dense representation. Currently,
+``float64``, ``int64`` and ``bool`` dtypes are supported. Depending on the original
+dtype, ``fill_value`` default changes:


Can you add a note here somewhere that for int and bool this was only added from 0.19 ?

jreback · 2016-08-27T15:12:12Z

joris your example already works you can have any values u want as actual values (both True and False); the fill value is for the missing value indicator when I need to densify (it's the default)

so this is not a conceptual change at all just a change to keep dtype consistency

jorisvandenbossche · 2016-08-27T15:18:00Z

@jreback I was looking at the to_sparse examples. So the fill_value is also used to convert from dense to sparse. So the output what you see there (eg in case of pd.Series([1, 0, 0]).to_sparse()) has changed (previously that was a block length of 3, now of 1). But no problem, I understand that the actual behaviour you want has not changed.

jorisvandenbossche · 2016-08-27T15:19:01Z

@jreback This PR for the rest OK to merge for you, Jeff? (it's closing a lot of issues for 0.19.0 :-))

jorisvandenbossche · 2016-08-29T12:51:01Z

@sinhrks Can you update the docstrings for SparseDataFrame, SparseSeries and SparseArray? They all still mention the fact that only floats are supported or that nan is the default fill value.

jorisvandenbossche · 2016-08-31T07:57:46Z

@sinhrks Thanks a lot!

jorisvandenbossche · 2016-08-31T11:08:08Z

@sinhrks appveyor started failing (some int dtype issues):

======================================================================
FAIL: test_append_zero (pandas.sparse.tests.test_list.TestSparseList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\sparse\tests\test_list.py", line 64, in test_append_zero
    tm.assert_sp_array_equal(sparr, SparseArray(arr, fill_value=0))
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1392, in assert_sp_array_equal
    assert_numpy_array_equal(left.sp_values, right.sp_values)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1083, in assert_numpy_array_equal
    assert_attr_equal('dtype', left, right, obj=obj)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: numpy array are different
Attribute "dtype" are different
[left]:  int64
[right]: int32

sinhrks · 2016-09-01T01:33:05Z

@jorisvandenbossche thx for pointing out, will fix.

sinhrks added Enhancement Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type labels Jul 30, 2016

sinhrks added this to the 0.19.0 milestone Jul 30, 2016

sinhrks force-pushed the sparse_dtype3 branch from 0c0c38d to e239ec7 Compare July 30, 2016 10:28

This was referenced Aug 1, 2016

ENH: add sparse op for int64 dtypes #13848

Closed

ENH: sparse astype now supports int64 and bool #13900

Closed

sinhrks force-pushed the sparse_dtype3 branch 2 times, most recently from 3ab5f3e to d500761 Compare August 4, 2016 11:48

sinhrks force-pushed the sparse_dtype3 branch 7 times, most recently from b49c1c8 to 4a7c84b Compare August 6, 2016 19:29

sinhrks force-pushed the sparse_dtype3 branch 2 times, most recently from 6c4e0ee to 4eacbec Compare August 8, 2016 11:21

This was referenced Aug 8, 2016

BUG: multi-type SparseDataFrame fixes and improvements #13917

Closed

BLD: Fix sparse warnings #13942

Closed

jreback pushed a commit that referenced this pull request Aug 9, 2016

BLD: Fix sparse warnings

ae26ec7

closes #13942 xref #13849

sinhrks force-pushed the sparse_dtype3 branch 6 times, most recently from c334402 to 21861f0 Compare August 16, 2016 22:37

sinhrks force-pushed the sparse_dtype3 branch from 9ba7098 to c2a3f80 Compare August 20, 2016 20:57

sinhrks force-pushed the sparse_dtype3 branch from c2a3f80 to f125cc6 Compare August 21, 2016 22:04

jreback reviewed Aug 25, 2016
View reviewed changes

sinhrks force-pushed the sparse_dtype3 branch from f125cc6 to d8cf411 Compare August 25, 2016 21:20

jorisvandenbossche reviewed Aug 27, 2016
View reviewed changes

sinhrks force-pushed the sparse_dtype3 branch 2 times, most recently from c040583 to 38c6661 Compare August 29, 2016 06:56

ENH: Sparse dtypes

341585a

sinhrks force-pushed the sparse_dtype3 branch from 38c6661 to 341585a Compare August 29, 2016 22:58

jorisvandenbossche merged commit b6d3a81 into pandas-dev:master Aug 31, 2016

sinhrks deleted the sparse_dtype3 branch September 1, 2016 01:33

jnothman mentioned this pull request Sep 6, 2016

Sparse becomes float under fancy indexing #14166

Closed

kawochen mentioned this pull request Dec 5, 2016

BUG: Sparse master issue #10627

Closed

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Sparse int64 and bool dtype support enhancement #13849

ENH: Sparse int64 and bool dtype support enhancement #13849

sinhrks commented Jul 30, 2016 •

edited

codecov-io commented Jul 30, 2016 •

edited

jreback commented Aug 7, 2016

jorisvandenbossche commented Aug 21, 2016

jorisvandenbossche commented Aug 21, 2016

sinhrks commented Aug 21, 2016

jreback Aug 25, 2016

jorisvandenbossche commented Aug 27, 2016

jorisvandenbossche commented Aug 27, 2016

jorisvandenbossche Aug 27, 2016

jreback commented Aug 27, 2016

jorisvandenbossche commented Aug 27, 2016

jorisvandenbossche commented Aug 27, 2016

jorisvandenbossche commented Aug 29, 2016 •

edited

jorisvandenbossche commented Aug 31, 2016

jorisvandenbossche commented Aug 31, 2016 •

edited

sinhrks commented Sep 1, 2016

ENH: Sparse int64 and bool dtype support enhancement #13849

ENH: Sparse int64 and bool dtype support enhancement #13849

Conversation

sinhrks commented Jul 30, 2016 • edited

codecov-io commented Jul 30, 2016 • edited

Current coverage is 85.27% (diff: 98.63%)

jreback commented Aug 7, 2016

jorisvandenbossche commented Aug 21, 2016

jorisvandenbossche commented Aug 21, 2016

sinhrks commented Aug 21, 2016

jreback Aug 25, 2016

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 27, 2016

jorisvandenbossche commented Aug 27, 2016

jorisvandenbossche Aug 27, 2016

Choose a reason for hiding this comment

jreback commented Aug 27, 2016

jorisvandenbossche commented Aug 27, 2016

jorisvandenbossche commented Aug 27, 2016

jorisvandenbossche commented Aug 29, 2016 • edited

jorisvandenbossche commented Aug 31, 2016

jorisvandenbossche commented Aug 31, 2016 • edited

sinhrks commented Sep 1, 2016

sinhrks commented Jul 30, 2016 •

edited

codecov-io commented Jul 30, 2016 •

edited

jorisvandenbossche commented Aug 29, 2016 •

edited

jorisvandenbossche commented Aug 31, 2016 •

edited